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ABSTRACT 


Linking score scales across different tests is considered speculative and fraught, even at the 
aggregate level (Feuer et al., 1999; Thissen, 2007). We introduce and illustrate validation 
methods for aggregate linkages, using the challenge of linking U.S. school district average 
test scores across states as a motivating example. We show that aggregate linkages can be 
validated both directly and indirectly under certain conditions, such as when the scores for at 
least some target units (districts) are available on a common test (e.g., the National 
Assessment of Educational Progress). We introduce precision-adjusted random effects 
models to estimate linking error, for populations and for subpopulations, for averages and for 
progress over time. These models allow us to distinguish linking error from sampling 
variability and illustrate how linking error plays a larger role in aggregates with smaller sample 
sizes. Assuming that target districts generalize to the full population of districts, we can show 
that standard errors for district means are generally less than 0.2 standard deviation units, 
leading to reliabilities above 0.7 for roughly 90% of districts. W/e also show how sources of 
imprecision and linking error contribute to both within- and between-state district 
comparisons within vs. between states. This approach is applicable whenever the essential 
counterfactual question—"“what would means/variance/progress for the aggregate units be, 


had students taken the other test?”—can be answered directly for at least some of the units. 
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Validation methods for aggregate-level test scale linking: 


A case study mapping school district test score distributions to a common scale 


Abstract 

Linking score scales across different tests is considered speculative and fraught, even at the 
aggregate level (Feuer et al., 1999; Thissen, 2007). We introduce and illustrate validation methods for 
aggregate linkages, using the challenge of linking U.S. school district average test scores across states as a 
motivating example. We show that aggregate linkages can be validated both directly and indirectly under 
certain conditions, such as when the scores for at least some target units (districts) are available ona 
common test (e.g., the National Assessment of Educational Progress). We introduce precision-adjusted 
random effects models to estimate linking error, for populations and for subpopulations, for averages and 
for progress over time. These models allow us to distinguish linking error from sampling variability and 
illustrate how linking error plays a larger role in aggregates with smaller sample sizes. Assuming that 
target districts generalize to the full population of districts, we can show that standard errors for district 
means are generally less than 0.2 standard deviation units, leading to reliabilities above 0.7 for roughly 
90% of districts. We also show how sources of imprecision and linking error contribute to both within- 
and between-state district comparisons within vs. between states. This approach is applicable whenever 
the essential counterfactual question—“what would means/variance/progress for the aggregate units be, 


had students taken the other test?” —can be answered directly for at least some of the units. 


Keywords: linking, scaling, multilevel modeling, achievement testing, NAEP 


Validation methods for aggregate-level test scale linking: 
A case study mapping school district test score distributions to a common scale 
Introduction 

As educational testing programs proliferate, non-overlapping populations and incomparable 
scales can limit the scope of research about the correlates and causes of educational achievement. 
Linking is the psychometric solution to this problem. Common persons, common populations, or common 
items across tests form the basis for estimated linking functions (Kolen & Brennan, 2014). These functions 
can enable mappings of scores from various tests to a common scale, enabling large-scale research about 
educational achievement. However, the bases for these linkages—common persons, populations, or 
items—are not always available at a large scale. When they are available, methods for evaluating the 
linkage for the purpose of large-scale research, rather than student-level uses like diagnosis and selection, 
are still in development (Thissen, 2007). 

Dorans and Holland (2000) outline five requirements for equating: (1) equal constructs, (2) equal 
reliability, (3) examinee indifference between tests, and (4) a symmetrical linking function that is (5) 
invariant across populations. These requirements are only realistically met within testing programs, not 
across them. For linkages that do not meet the stringent conditions of equating, the appropriateness of 
the linkage becomes dependent on the interpretations and uses of the linked scores. 

We present a case of aggregate-level linking whose purpose is to support educational research. 
First, we show how a common assessment at one level of aggregation (the state, in our example) can 
serve as the basis for a common-population linkage. Second, we demonstrate how the assessment can 
directly validate the linkage on which it is based, if the assessment also reports scores at a lower level of 
aggregation (the school district, here). Third, we show how to validate inferences about progress over 
time in addition to inferences about relative achievement. Fourth, we show how additional assessments 


that are common across a subset of the lower-level units can provide indirect validation of the linking. 


Although none of the methods we present is new on its own, the logic and methods in this validation 


approach is likely to be useful in other aggregate linking scenarios. 


A case comparing U.S. school district achievement across states 

To understand how a “patchwork” administration of tests can support aggregate linking, we 
present the case of linking U.S. school district average scores to a common scale. U.S. school districts 
differ dramatically in their socioeconomic and demographic characteristics (Reardon, Yun, & Eitle, 1999; 
Stroub & Richards, 2013), and districts have considerable influence over instructional and organizational 
practices that may affect academic achievement (Whitehurst, Chingos, & Gallaher, 2013). Nonetheless, 
we have relatively little rigorous large-scale research describing national patterns of variation in 
achievement across districts, let alone an understanding of the factors that cause this variation. Such 
analyses generally require district-level test score distributions that are comparable across states. No 
such nation-wide, district-level achievement dataset currently exists, because school districts do not 
administer a common set of assessments to all districts across states. 

Existing assessments enable some comparisons of academic performance across states or school 
districts, but none provides comprehensive comparisons across grades, years, and all school districts. At 
the highest level, the National Assessment of Educational Progress (NAEP) provides comparable state- 
level scores in odd years, in reading and mathematics, in grades 4 and 8. NAEP also provides district-level 
scores, but only for a small number of large urban districts under the Trial Urban District Assessment 
(TUDA) initiative: TUDA began with 6 districts in 2002 and slowly grew to 27 districts by 2017. Within 
individual states, we can compare district achievement within a given grade and year using state math 
and reading/English Language Arts (ELA) tests federally mandated by the No Child Left Behind (NCLB) act, 
administered annually in grades 3-8. Comparing academic achievement across state lines requires either 


that districts administer a common test, or that the scores on the state tests can be linked to a common 


scale. However, state accountability tests generally differ across states. Each state develops and 
administers its own tests; these tests may assess somewhat different content domains; scores are 
reported on different, state-determined scales; and proficiency thresholds are set at different levels of 
achievement. Moreover, the content, scoring, and definition of proficiency may vary within any given 
state over time and across grades. 

As aresult, direct comparisons of average scores or percentages of proficient students across 
states (or in many cases within states, across grades and years) are unwarranted and misleading. Average 
scores may differ because scales differ and because performance differs. Proficiency rates may differ 
because proficiency thresholds differ (Bandeira de Mello, Blankenship, & McLaughlin; Braun & Qian, 
2007) and because performance differs. The ongoing rollout of common assessments developed by 
multistate assessment consortia (such as the Partnership for Assessment of Readiness for College and 
Careers, PARCC, and the Smarter Balanced Assessment Consortium, SBAC) is certainly increasing 
comparability across states, but only to the extent that states use these assessments. Customization of 
content standards by states may also discourage the reporting of results on a common scale across states 
(Gewertz, 2015; U.S. Department of Education, 2009). Given the incomplete, divided, and declining state 
participation in these consortia, comprehensive, directly comparable district-level test score data in the 
U.S. remain unavailable. 

In some cases, districts also administer voluntarily-chosen assessments, often for lower-stakes 
purposes. When two districts adopt the same such assessments, we can compare test scores on these 
assessments among districts. One of the most widely used assessments, the Measures of Academic 
Progress (MAP) test from Northwest Evaluation Association (NWEA), is voluntarily administered in several 
thousand school districts, over 20% of all districts in the country. However, the districts using MAP are 
neither a representative nor comprehensive sample of districts. 


In this paper, we present a validation strategy for comparisons of district-level test scores across 


states, years, and grades. We rely on a combination of a) population-level state test score data from NAEP 
and state tests; b) linear transformations that link state test scores to observed and interpolated NAEP 
scales; and c) a set of validation checks to assess the accuracy of the resulting linked estimates. In 
addition, we provide formulas to quantify the uncertainty in both between- and within-state 
comparisons. Together, this represents a suite of approaches for constructing and evaluating linked 
estimates of test score distributions. 

We use data from the EDFacts Initiative (U.S. Department of Education, 2015), NAEP, and NWEA. 
We obtain population-level state testing data from EDFacts; these data include counts of students in 
ordered proficiency categories for each district-grade-year-subject combination. We fit heteroskedastic 
ordered probit (HETOP) models to these district proficiency counts, resulting in estimated district means 
and variances on a state standardized (zero mean and unit variance) scale (Reardon, Shear, Castellano, & 
Ho, 2016). We then apply linear linking methods that adjust for test reliability (reviewed by Kolen and 
Brennan, 2014) to place each district’s estimated score distribution parameters on a common national 
scale. Our linking methods are similar to those that Hanushek and Woessman (2012) used to compare 
country-level performance internationally. At the district level (Greene & McGee, 2011) and school level 
(Greene & Mills, 2014), the Global Report Card (GRC) maps scores onto a national scale using proficiency 
rates, using a somewhat different approach than ours.* What we add to these standard linear linking 
methods are direct and indirect validation methods that take advantage of patchwork reporting of test 
scores at the target levels of aggregation. We also develop an approach to assessing the uncertainty in 


linked estimates resulting from both measurement error and linking error. 


1 Our data and methods are more comprehensive than those used in the GRC (Greene & McGee, 2011; Greene & 
Mills, 2014; http://globalreportcard.org/). First, we provide grade-specific estimates (by year), allowing for estimates 
of measures of progress. Second, instead of the statistical model we describe below (Reardon, Shear, Castellano, & 
Ho, 2016), which leverages information from three cut scores in each grade, the GRC uses only one cut score and 
aggregates across grades. This assumes that stringency is the same across grades and that district variances are 
equal. Third, our methods allow us to provide standard errors for our estimates. Fourth, we provide both direct and 
indirect validation checks for our linkages. 


Although some have argued that using NAEP as a basis for linking state accountability tests is 
both infeasible and inappropriate for high-stakes student-level reporting (Feuer, Holland, Green, 
Bertenthal, & Hemphill, 1999), our goal here is different. We do not attempt to estimate student-level 
scores, and we do not intend the results to be used for high-stakes accountability. Rather, our goal is to 


estimate transformations that render aggregate test score distributions roughly comparable across 


districts in different states, so that the resulting district-level distributions can be used in aggregate-level 
research. We grant that NAEP and state tests may differ in many respects, including content, testing 
dates, motivation, accommodations for language, accommodations for disabilities, and test-specific 
preparation. While accepting these sources of possible linking error, we focus on the counterfactual 
question that linking asks: How well do our linked district scores from state tests recover the average 
NAEP scores that these districts would have received, had their students taken NAEP? In this way, we 
treat the issue of feasibility empirically, by using validation checks to assess the extent to which our 
methods yield unbiased estimates of aggregate parameters of interest. 

In any situation where states have some subset of districts with common scores on an alternative 
test, it is unlikely that those districts will be representative given their selection into the alternative test. 
When districts are representative, the methods we introduce will provide estimates of linking errors for 
the full population of districts. When districts are not representative, these methods provide estimates of 
linking errors for the particular subpopulations of districts they represent. In these latter cases, these 
methods provide evidence about the validity of the linkage for a subpopulation, a recommended step in 


the validation of any linking (Dorans & Holland, 2000). 


Data 
We use state accountability test score data and state NAEP data to link scores, and we use NAEP 


TUDA data and NWEA MAP data to evaluate the linkage. Under the EDFacts Initiative (U.S. Department of 


Education, 2015), states report frequencies of students scoring in each of several ordered proficiency 
categories for each tested school, grade, and subject (mathematics and reading/ELA). The numbers of 
ordered proficiency categories vary by state, from 2 to 5, most commonly 4. We use EDFacts data from 
2009 to 2013, in grades 3-8, provided to us by the National Center for Education Statistics under a 
restricted data use license. These data are not suppressed and have no minimum cell size. We also use 
reliability estimates collected from state technical manuals and reports for these same years and grades.” 
Average NAEP scores and their standard deviations are reported for states and participating 
TUDA districts in odd years, in grades 4 and 8, in reading and mathematics. In each state and TUDA 
district, these scores are based on an administration of the NAEP assessments to representative samples 
of students in the relevant grades and years. We use years 2009, 2011, and 2013 as a basis for linking, 
which include 17 TUDA districts in 2009 and 20 TUDA districts in 2011 and 2013.2 The NAEP state and 
district means and standard deviations, as well as their standard errors, are available from the NAEP Data 
Explorer (U.S. Department of Education, n.d.). To account for NAEP initiatives to expand and standardize 
inclusion of English learners and students with disabilities over this time period, we rely on the Expanded 
Population Estimates (EPE) of means and standard deviations provided by the National Center of 
Education Statistics (see Braun, Zhang, & Vezzu, 2008; McLaughlin, 2005; National Institute of Statistical 


Sciences, 2009).4 


* From 2009-2013, 63% of 3,060 state(51)-grade(6)-subject(2)-year(5) reliability coefficients were available. 
Consistent with Reardon and Ho (2015), the reported reliabilities have a mean of 0.905 and a standard deviation of 
0.025. Missing reliabilities were imputed as predicted values from a linear regression of reliability on state-grade- 
subject and state-year fixed effects. The residuals from the model have a standard deviation of 0.010. This suggests 
that imputation errors based on the model have a standard deviation of 0.010, which is very small compared to the 
mean value of the reliabilities (0.905). As a result imputation errors in reliability are not likely to be consequential. 
3 |n 2009, the 17 districts are Atlanta, Austin, Baltimore, Boston, Charlotte, Chicago, Cleveland, Detroit, Fresno, 
Houston, Jefferson County, Los Angeles, Miami, Milwaukee, New York City, Philadelphia, and San Diego. 
Albuquerque, Dallas, and Hillsborough County joined in 2011 and 2013. Washington, DC is not included for 
validation, as it has no associated state for linking. California districts (and Texas districts in 2013) did not have a 
common Grade 8 state mathematics assessment, so the California and Texas districts lack a linked district mean for 
Grade 8 mathematics. 

4 Note that the correlation of EPE and regular NAEP estimates are near unity; as a result, our central substantive 
conclusions are unchanged if we use the regular NAEP estimates in the linking. 


Finally, we use data from the NWEA MAP test that overlap with the years, grades, and subjects 
available in the EDFacts data: 2009-2013, grades 3-8, in reading/ELA and mathematics. Student-level MAP 
test score data (scale scores) were provided to us through a restricted-use data sharing agreement with 
NWEA. Several thousand school districts chose to administer the MAP assessment in some or all years 
and grades that overlap with our EDFacts data. Participation in the NWEA MAP is generally binary in 
districts administering the MAP; that is, in participating districts, either very few students or essentially all 
students are assessed. We exclude cases in any district’s grade, subject, and year, where the ratio of 
assessed students to enrolled students is lower than 0.9 or greater than 1.1. This eliminates districts with 
scattered classroom-level implementation as well as very small districts with accounting anomalies. 
Excluded districts comprise roughly 10% of the districts using the NWEA MAP tests. After these 
exclusions, we estimate district-grade-subject-year means and standard deviations from student-level 


data reported on the continuous MAP scale. 


Linking Methods 

The first step in linking the state test scores to the NAEP scale is to estimate district-level score 
means and standard deviations. If individual scale score data or district-level means and standard 
deviations are available, one could simply use these to obtain the necessary district-level parameter 
estimates. Such data are not readily available in most states, however. Instead, we estimate the district 
score means and standard deviations from the coarsened proficiency count data available in EDFacts, 
using the methods described in detail by Reardon, Shear, Castellano, and Ho (2016). These parameters 
are scaled relative to the state-wide standardized score distribution on the state assessment. We do this 
in each state, separately for each grade, year, and subject. 

Reardon, Shear, Castellano, and Ho (2016) demonstrate that a heteroskedastic probit (HETOP) 


model can be used to estimate group (district) test score means and standard deviations from coarsened 


data. The HETOP model assumes that there is some monotonic transformation of a state’s test scale in 
which each district’s test score distribution is normal (with a district-specific mean and standard 
deviation) and that the observed ordered proficiency counts in each district are the result of coarsening 
the districts’ normal score distributions using a common set of proficiency thresholds. Given these 
assumptions, the HETOP model provides estimates of each districts’ score mean and standard deviation. 
The resulting estimates are generally unbiased and are only slightly less precise than estimates obtained 
from (uncoarsened) student-level scale score data in typical state and national educational testing 
contexts. We refer readers to their paper (Reardon, Shear, Castellano, & Ho, 2016) for technical specifics. 
Because most states do not report district-level means and standard deviations, the ability to estimate 
these distributional parameters from coarsened proficiency category data is essential, given that such 
categorical data are much more readily available (e.g., EDFacts). Nonetheless, the method of constructing 
district-level score means and standard deviations is not central to the linking methods we discuss. The 
following methods are applicable whenever district means and standard deviations are available with 
corresponding standard errors. 

Fitting the HETOP model to EDFacts data yields estimates of each district’s mean test score, 
where the means are expressed relative to the state’s student-level population mean of 0 and standard 
deviation of 1, within each grade, year, and subject. We denote these estimated district means and 


state astate 


standard deviations as Hay gb and Gaygb) respectively, for district d, year y, grade g, and subject b. The 


HETOP model estimation procedure also provides standard errors of these estimates, denoted se(Aayon 


and se(Gayt5 ), respectively (Reardon, Shear, Castellano, & Ho, 2016).° 


° Note that, because there is measurement error in the state accountability test scores, estimates of Ugygp and 
Oaygp that are standardized based on the observed score distribution will be biased estimates of the means and 
standard deviations expressed in terms of the standardized true score distribution (the means will be biased toward 
0; the standard deviations will be biased toward 1). Before linking, we adjust Aigvcp, and Gjy4p, to account for 
measurement error using the classical definition of reliability as the ratio of true score variance over observed score 
variance. We adjust the means and their standard errors by dividing them by the square root of the state test score 


reliability (9) in the relevant year, grade, and subject. We adjust the standard deviations and their standard errors by 


10 


The second step of the linking process is to estimate a linear transformation linking each 
state/year/grade/subject scale (standardized to a student-level mean of 0 and standard deviation of 1— 
the scale of Livan) to its corresponding NAEP distribution. Recall that we have estimates of NAEP means 


and standard deviations at the state (denoted s) level, denoted eas and mee respectively, as well as 


ygb’ 
their standard errors. To obtain estimates of these parameters in grades (3, 5, 6, and 7) and years (2010 
and 2012) in which NAEP was not administered, we interpolate and extrapolate linearly.® First, within 
each NAEP-tested year, 2009, 2011, and 2013, we interpolate between grades 4 and 8 to grades 5, 6, and 
7 and extrapolate to grade 3. Next, for all grades 3-8, we interpolate between the NAEP-tested years to 
estimate parameters in 2010 and 2012. We illustrate this below for means, and we apply the same 


approach to standard deviations.’ Note that this is equivalent to interpolating between years first and 


then interpolating and extrapolating to grades. 


‘ , gos i ; 
He Cant —— (Asyeb —psyap) forg € {3,5,6, 7}:y € {2009, 2011, 2013}; and V s,b 


G2+p-1 
po? 


. After these adjustments, Aigygy and Gavan 


dygb are expressed in terms of the standardized 


multiplying them by 


distribution of true scores within the state. We do not adjust NAEP state means and standard deviations, as NAEP 
estimation procedures account for measurement error due to item assignment to examinees (Mislevy, Muraki, & 
Johnson, 1992). 
5 Interpolation relies on some comparability of NAEP scores across grades. Vertical linking was built into NAEP’s early 
design via cross-grade blocks of items (administered to both 4 and 8" graders) in 1990 in mathematics and in 1992 
in reading (Thissen, 2012). These cross-grade blocks act as the foundation for the bridge spanning grades 4 and 8. At 
around that time, the National Assessment Governing Board that sets policy for NAEP adopted the position that, as 
Haertel (1991) describes, “NAEP should employ within-age scaling whenever feasible” (p. 2). Thissen notes that 
there have therefore been few checks on the validity of the cross-grade scales since that time. One exception is a 
presentation by McClellan, Donoghue, Gladkova, and Xu (2005), who tested whether subsequent vertical linking 
would have made a difference on the reading assessment. They concluded that “the current cross-grade scale 
design used in NAEP seems stable to the alternate design studied” (p. 37). We do not use interpolation between 
grades to explore growth trajectories on an absolute scale but rather identify relative position of districts, which are 
very consistent across grades. Our fourth validation check provides evidence that this interpolation is justified. 
7 Note that the sampling variances of the interpolated means and standard deviations will be functions of the 
sampling variances of the non-interpolated values. For example: 

2 2 
var (aneer) = (2) var(anssp)+(2=*) var(ansse), for g € (3,5,6,7y € (2009, 2011, 2013}; and W's, b 


nN 1 anaep anaep 
var( Arya) = z|var (aaa) + var (Qe aa for g € {3, 4,5, 6,7, 8}; y € {2010,2012}; and V s, b. 


AA 


1 
anaep __ anaep anaep : : 
lisygd = ACHReare + ian) for g € {3,4,5, 6,7,8}; y € {2010,2012}; and V s,b 


We evaluate the validity of linking to interpolated NAEP grades and years explicitly later in this paper. 


astate astate 


Because the estimated district test score moments figy gn and Ggygp are expressed on a state 


scale with mean O and unit variance, the estimated mapping of the standardized test scale in state s, year 


state 


y, grade g, and subject b to the NAEP scale is given by Equation (2) below. Given figy.gp, this mapping 


yields an estimate of the of the district average performance on the NAEP scale; denoted Davao: Given 


this mapping, the estimated standard deviation, on the NAEP scale, of scores in district d, year y, grade g, 


and subject b is given by Equation (3). 


avaep _ anaep | astate , snaep 
Haygo = Hsygb + Haygb * Psy gb 


anaep _ astate , anaep 
Saygb =O 


dygb 9syqb 
(3) 
The intuition behind Equation (2) is straightforward: districts that belong to states with relatively 


anaep 


Assy gb should be placed higher on the NAEP scale. Within states, districts that are 


high NAEP averages, 


high or low relative to their state (positive and negative on the standardized state scale) should be 


relatively high or low on the NAEP scale in proportion to that state’s NAEP standard deviation, Go 


sygb° 
From Equations (2) and (3), we can derive the (squared) standard errors of the linked means and 

standard deviations for non-interpolated grades and years, incorporating the imprecision from the 

estimates of state and NAEP means and standard deviations. In these derivations, we assume that the 


linking assumption is met. Later in this paper, we relax this assumption and provide expressions for the 


standard errors of the linked means that include the linking error. For simplicity in these derivations, we 


12 


assume Devas anda OE are independent random variables, ® which yields: 


2 
araep \ __ anaep anaep astat 
var (a3) = var(areer ) + (ee. var (Agyap 
~state anaep aep state 
+ (Ag Gp) “var Sap) + var(aneer war (ass ); 
ahaep gstate Cae ahaep gstate ahaep 
var Ga var Cage) [par(s ee + ( Soon) i + Gayub) “var( ener ). 

The linking equations (2 and 3) and the standard error formulas (equations 4 and 5) here are 
accurate under the assumption that there is no linking error—the assumption that the average 
performance of students in any given district relative to the average performance of students in their 
state would be the same on the NAEP and state assessments. This is a strong and untested assumption. 


We next provide a set of validation analyses intended to assess the accuracy of this assumption. We then 


provide modifications of the standard error formulas here that take the linking error into account. 


Validation Checks and Results 

The linking method we use here, on its own, is based on the untested assumption that districts’ 
distributions of scores on the state accountability tests have the same relationship to one another (i.e., 
the same relative means and standard deviations) as they would if the NAEP assessment were 
administered in lieu of the state test. Implicit in this assumption is that differences in the content, format, 
and testing conditions of the state and NAEP tests do not differ in ways that substantially affect aggregate 
relative distributions. This is, on its face, a strong assumption. 


Rather than assert that this assumption is valid, we empirically assess it, using the patchwork 


8 This is not strictly true, since Bovad and 6"? are estimated from the same sample. However, the NAEP samples 


sygb 
are large within each state-year-grade-subject, so the covariance of the estimated means and standard deviations is 


very small relative to other sources of sampling variance in Equation (4). 


13 


reporting and administration of district results by NAEP and NWEA. We do this in several ways. First, for 


the districts participating in the NAEP TUDA assessments over these years, we compare Baygo—the 


estimated district mean based on our linking method—to Bayo —the mean of NAEP TUDA scores from 


the district. This provides a direct validation of the linking method, since the TUDA scores are in the 
metric that the linking method attempts to recover but are not themselves used in any way in the linking 
process. We repeat this linkage for demographic subgroups to assess the population invariance of the 
link. 

Second, we assess the correlation of our linked district estimates with district mean scores on the 
NWEA MAP tests. This provides the correlation across a larger sample of districts. However, the NWEA 
MAP test has a different score scale, so it does not provide direct comparability with the NAEP scale that 
is the target of our linking. 

Third, for the relevant TUDA districts, we assess whether within-district differences in linked 
scores across grades and cohorts correspond to those differences observed in the NAEP data. That is, we 
assess whether the linking provides accurate measures of changes in scores across grades and cohorts of 
students, in addition to providing accurate means in a given year. 

Fourth, we conduct a set of validation exercises designed to assess the validity of the 
interpolation of the NAEP scores in non-NAEP years and grades. For all of these analyses, we present 
evidence regarding the district means; corresponding results for the standard deviations are in the 


appendices. 


Validation Check 1: Recovery of TUDA means 


| 


The NAEP TUDA data provide estimated means and standard deviations on the actual “naep” 


scale, LL and oa or large urban aistricts in an In an . For these particular 
le, Bayan aNd Gay gp for | ban districts in 2009 and 20 in 2011 and 2013. For th icul 


large districts, we can compare the NAEP means and standard deviations to their linked means and 
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anaep _ ~naep anaep _ anaep 
Haygo — Haygn 294 Faygn — Caygn: | 


standard deviations. For each district, we obtain discrepancies 
there were no sampling or measurement error in these estimates, we would report the average of these 
discrepancies as the bias, and would report the square root of the average squared discrepancies as the 
Root Mean Squared Error (RMSE). We could also report the observed correlation between the two as a 
measure of how well the linked estimates align linearly with their reported TUDA values. However, 
because of imprecision in both the NAEP TUDA and linked estimates, the RMSE will be inflated and the 
correlation will be attenuated as measures of recovery. Instead, we report measurement-error corrected 
RMSEs and correlations that account for imprecision in both the linked and TUDA parameter estimates. 
To estimate the measurement-error corrected bias, RMSE, and correlation in a given year, grade, and 
nnaep 


subject, we fit the model below, using the sample of districts for which we have both estimates Laygp 


anaep 


~naep aniaep ; : 
and Laygp (or Fay gb and Oaygb aS the case may be; the model is the same for the means or standard 


deviations): 
Hiaygb = Qodygb (LINKED,) + Q1dygb (TUDA;) + Cidygb 
Aodygb = Boo + Uodygb 


A1dygb = Bio + Ujdygb 


Ciaygo ~N (0, Wiaygp)i Uaygo~MVN(0,7), 


where i indexes source (linked or NAEP TUDA test), Wiay gb is the estimated sampling variance (the 


F Too 01]. ; 
squared standard error) of fi; which we treat as known), and tT = | is the variance- 
q ) Hidygb ( ) To1 T11 
covariance matrix of the linked and TUDA parameter values which must be estimated. Given the model 


estimates, we estimate the bias, B = Bog — Bio, and RMSE = [B? + bib’) ”” whereb=[1 —1lisa 
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To1 9 


Vv Tooti1 


1 x 2 design matrix. Finally, we estimate the correlation of @paygp ANd Ayaygp aS = 
Table 1 reports the results of these analyses in each subject, grade, and year in which we have 
TUDA estimates (see Table Al for the corresponding table for standard deviations). Although we do not 
show the uncorrected estimates here, we note that the measurement error corrections have a negligible 
impact on bias and reduce the (inflated) RMSE by around 8% on average. On average, the linked 
estimates overestimate actual NAEP TUDA means by roughly 1.8 points on the NAEP scale, or around 0.05 
of astandard deviation unit, assuming a NAEP scale standard deviation of 35 (NAEP standard deviations 
vary from roughly 30 to 40 across subjects, years, and grades). The bias is slightly greater in earlier years 


and in mathematics. 


° Note that this model assumes the errors Cidygp are independent within each district-grade-year-subject. The error 
in the NAEP TUDA estimate arises because of sampling variance (because the NAEP assessment was given to only a 
random sample of students in each TUDA district). The error in the linked estimate arises because of a) error in the 
estimated district mean score, and b) sampling error in the NAEP estimates of the state mean and standard 
deviation (see equation 4). The error in the estimated state mean arises from the fact that the HETOP model 
estimates the mean score from coarsened data, not from sampling variance (because the state assessments include 
the full population of students); the error in the NAEP state mean and standard deviation arises from sampling 
variance in the state NAEP samples. Both the coarsening error and the state sampling error are independent of the 
sampling error in the NAEP district mean estimate. 
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Table 1: Recovery of NAEP TUDA means following state-level linkage of state test score distributions to 


the NAEP scale, measurement error adjusted. 


Recovery 
Subject Grade Year n RMSE Bias Correlation 
2009 17 3.95 2.12 0.96 
4 2011 20 3.69 1.25 0.96 
Reading 2013 20 2.62 0.20 0.98 
2009 17 2.92 1.12 0.95 
8 2011 20 2.20 0.63 0.97 
2013 20 3.62 1.67 0.93 
2009 17 6.09 4.10 0.93 
4 2011 20 4.97 2.60 0.94 
Math 2013 20 3.60 1.46 0.95 
2009 14 D2 3.40 0.95 
8 2011 17 3.77 2.09 0.96 
2013 14 4.54 1.47 0.94 
2009 14-17 4.70 2.69 0.95 
2011 17-20 3.79 1.64 0.96 
Average 2013 14-20 3.66 1.20 0.95 
Reading 17-20 3.23 1.17 0.96 
Math 14-20 4.77 2.52 0.95 
All 14-20 4.07 1.84 0.95 
Male 14-20 4.14 1.84 0.97 
Female 14-20 3.95 1.70 0.98 
ned White 11-19 3.89 0.66 0.98 
Black 13-19 4.11 1.80 0.96 
Hispanic 12-20 4.07 2.08 0.94 


Source: Authors calculations from EDFacts and NAEP TUDA Expanded Population 
Estimates data. RMSE and bias are measured in NAEP scale score points. Estimates are 
based on Equation 7 in text. Subgroup averages are computed from a model that pools 
across grades and years within subject (like Equation 9 in text); the subject averages are 


then pooled within subgroup. 


This positive bias indicates that the average scores of students in the TUDA districts are 


systematically higher in the statewide distribution of scores on the state accountability tests than on the 


NAEP test. This leads to a higher-than-expected NAEP mapping. Table 1 also shows that the average 
estimated precision-adjusted correlation (disattenuated to account for the imprecision in the observed 


means) is 0.95 (note that the simple unadjusted correlation is 0.94; measurement error in the means is 
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relatively minor relative to the true variation in the means of the TUDA districts). Figure 1 shows 


scatterplots of the estimated linked means versus the observed TUDA means, separately for grades and 


subjects, with the identity lines displayed as a reference. 


Figure 1. Comparison of reported means from NAEP TUDA and NAEP-linked state test score distributions, 
grades 4 and 8, Reading and Mathematics, in 2009, 2011, and 2013. 
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18 


Note that under a linear linking such as Equation 2, our definition of bias implies that the 
weighted average bias, among all districts within each state, and across all states, is O by design. If we had 
all districts, the bias in Table 1 would be 0; it is not O because Table 1 summarizes the bias for only the 
subset of NAEP urban districts for which we have scores. The RMSE similarly describes the magnitude of 
error (the square root of average squared error) for these districts and may be larger or smaller than the 
RMSE for other districts in the state. 

We review here four possible explanations for discrepancies between a district’s average scores 
on the state accountability test and on the NAEP assessments. These are not meant to be exhaustive 
explanations; they illustrate possible substantive reasons for linking error and variance. First, the 
population of students assessed in the two instances may differ. For example, a positive discrepancy may 
result if the target district excluded low scoring students from state tests but not from NAEP. If this 
differential exclusion were greater in the target district, on average, than in other districts in the state, 
the target district would appear higher in the state test score distribution than it would in the NAEP score 
distribution, leading to a positive discrepancy between the district’s linked mean score and its NAEP mean 
scores. Likewise, a positive discrepancy would result if the NAEP assessments excluded high scoring 
students more in the TUDA assessment than in the statewide assessment, or if there were differential 
exclusion of high-scoring students in other districts on the state test relative to the target district and no 
differential exclusion on NAEP. In other words, the discrepancies might result from a target district’s 
scores being biased upward on the state test or downward on the NAEP assessment relative to other 
districts in the state, and/or from other districts’ scores being biased downward on the state test or 
upward on the NAEP assessment relative to the target district. 

Second, the discrepancies may result from differential content in NAEP and state tests. If a 
district’s position in the state distribution of skills/knowledge measured by the state test does not match 
its position in the statewide distribution of skills measured by the NAEP assessment, the linked scores will 


19 


not match those on NAEP. The systematic positive discrepancies in Table 1 and Figure 1 may indicate that 
students in the TUDA districts have disproportionately higher true skills in the content areas measured by 
their state tests than the NAEP assessments relative to other districts in the states. In other words, if large 
districts are better than other districts in their states at teaching their students the specific content 
measured by state tests, relative to their effectiveness in teaching the skills measured by NAEP, we would 
see a pattern of positive discrepancies like that in Table 1 and Figure 1. 

Third, relatedly, students in the districts with a positive discrepancy may have relatively high 
motivation for state tests over NAEP, compared to other districts. Fourth, the bias evident in Table 1 and 
Figure 1 may indicate differential score inflation or outright cheating. For example, some of the largest 
positive discrepancies among the 20 TUDA districts illustrated in Figure 1 are in Atlanta in 2009, where 
there was systematic cheating on the state test in 2009 (Wilson, Bowers, & Hyde, 2011). The 
discrepancies in the Atlanta estimates are substantially smaller (commensurate with other large districts) 
in 2011 and 2013, after the cheating had been discovered. In this way, we see that many possible sources 
of bias in the linking are sources of bias with district scores on the state test itself, rather than problems 
with the linking per se. 


We also address the population invariance of the linking (e.g., Kolen & Brennan, 2014; Dorans & 


Holland, 2000) by reporting the average direction and magnitude (RMSE) of discrepancies, oes - 


anaep 


Lay gp for selected gender and racial/ethnic subgroups in Table 1. The number of districts is lower in 


some grade-year cells due insufficient subgroup samples in some districts.?° The RMSEs are only slightly 


larger for subgroups than the RMSE for all students, and bias is similar in magnitude for all groups. We 


10 Our model for subgroups pools across grades (4 and 8) and years (2009, 2011, and 2013) to compensate for 
smaller numbers of districts in some grade-year cells. We describe this model in Validation Check 3 below. On 
average across grades and years, the results are similar to a model that does not use pooling. We also calculate bias 
and RMSE for Asian student populations but do not report them due to small numbers of districts: 5-10 per cell. 
However, bias and RMSE of linked district estimates were higher for Asians, suggesting caution against conducting a 
separate linkage for Asian students. 
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conclude from these comparable values that the linking functions recover NAEP district means similarly, 


on average, across subgroups. 


Validation Check 2: Association with NWEA MAP means 

The NWEA MAP test is administered in thousands of school districts across the country. Because 
the MAP tests are scored on the same scale nationwide, district average MAP scores can serve as a 
second audit test against which we can compare the linked scores. As noted previously, in most tested 
districts, the number of student test scores is very close to the district’s enrollment in the same subject, 
grade, and year. For these districts, we estimate means and standard deviations on the scale of the MAP 
test. The scale differs from that of NAEP, so absolute discrepancies are not interpretable. However, 
strong correlations between linked district means and standard deviations and those on MAP represent 
convergent evidence that the linking is appropriate. 


We calculate disattenuated correlations between the observed MAP means deviations and both 


the HETOP estimated means (prior to linking them to the NAEP scale) and the linked means from 
Equation 2. The improvement from the correlation of MAP means and HETOP estimates to the 
correlation of MAP means and NAEP-linked estimates is due solely to the move from the “state” to the 
“faep” scale, shifting all districts within each state according to NAEP performance. 

Table 2 shows that correlations between the linked district means and MAP district means are 
0.93 on average when adjusting for imprecision (see Table A2 for the corresponding table for standard 
deviations). This is larger than the average correlation of 0.87 between the MAP means and the 
(unlinked) HETOP estimates. Figure 2 shows a bubble plot of district MAP scores on linked scores for 
Grade 4 mathematics in 2009, as an illustration of the data underlying these correlations. Note that the 
points plotted in Figure 2 are means estimated with imprecision. The observed (attenuated) correlations 


are generally .03 to .10 points lower than their disattenuated counterparts. 
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Table 2: Precision-adjusted correlations of linked district means with NWEA MAP district means before 
and after state-level linkage of state test score distributions to the NAEP scale. 


Precision-Adjusted Correlations 


Subieet Grade aa N With HETOP With Linked 
Estimates Estimates 
2009 1139 0.90 0.95 
4 2011 1472 0.87 0.93 
: 2013 1843 0.92 0.95 
Reading 
2009 959 0.84 0.91 
8 2011 1273 0.87 0.91 
2013 1597 0.88 0.92 
2009 1128 0.86 0.93 
4 2011 1467 0.82 0.90 
2013 1841 0.87 0.93 
Math 
2009 970 0.83 0.93 
8 2011 1279 0.84 0.92 
2013 1545 0.87 0.95 
2009 4196 0.86 0.93 
Average 2011 5491 0.85 0.91 
2013 6826 0.88 0.94 
All Years 16513 0.87 0.93 


Source: Authors’ calculations from EDFacts and NWEA MAP data. NWEA MAP = Northwest 
Evaluation Association Measures of Academic Progress. Sample includes districts with reported 
NWEA MAP scores for at least 90% of students. 


Figure 2. Example of an association between linked means and NWEA MAP means, grade 4 math, 2009. 
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Note: Correlation of .87; precision-adjusted correlation of .93. Bubble size corresponds to district enrollment. 


Validation Check 3: Association of between-grade and -cohort trends 

An additional assessment of the extent to which the linked state district means match the 
corresponding NAEP district means compares not just the means in a given grade and year, but the 
within-district differences in means across grades and years. If the discrepancies evident in Figure 1 are 
consistent across years and grades within a district, then the linked state estimates will provide accurate 
measures of the within-district trends across years and grades, even when there is a small bias in in the 
average means. 

To assess the accuracy of the across-grade and -year differences in linked mean scores, we use 
data from the 20 TUDA districts from the grades and years in which we have both linked means and 
corresponding means from NAEP. We do not use the NAEP data from interpolated years and grades in 


this model. We fit the same model for both means and standard deviations, and separately by subject. 
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For each model, we fit precision-weighted random coefficients models of this form: 
Bidygo = Codygp (LINKED;) + Qaygp(TUDA,) + eiaygn 
Aoaygb = Booa + Bora (yearay gp — 201 1) 1 Bora(gradegy gp = 6) + Upaygb 
A aygb = Pioa + Bria (yearay gp = 2011) + Broa (gradegy gp S 6) + Urdygb 
Booa = Yoo + Vooa 
Boia = Yo1 + Void 
Boza = You + Voza 
Broa = Y10 + Vioa 
Buia = ¥11 + Vita 
Bi2a = V12 + Viza 
Ciaygo~N (0, Wiaygp)i Uaygo~MVN(0, 07); ; va~MVN(0,1), 


(7) 


2 
idyg 


where t indexes source (linked or NAEP TUDA test) and Wigygp is the sampling variance of fligyap (which 
we treat as known and set equal to the square of the estimated standard error of fligyap). The vector T = 
{Yoo --» Viz} contains the average intercepts, year slopes, and grade slopes (in the second subscript, 0, 1, 
and 2, respectively) for the linked values and the target values (in the first subscript, O and 1, 
respectively). The differences between the corresponding elements of F indicate average bias (i.e., the 
difference between Yogg and Yq indicates the average deviation of the linked means and the NAEP TUDA 
means, net of district-specific grade and year trends). Unlike Table 1 above, where we estimated bias 
separately for each year and grade and descriptively averaged them, the bias here is estimated by pooling 
over all years and grades of TUDA data, with district random effects. If the linking were perfect, we would 
expect this to be 0. 


The matrix of random parameters T includes, on the diagonal, the between-district variances of 


the average district means and their grade and year trends; the off-diagonal elements are their 
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covariances. From T we can compute the correlation between the within-district differences in mean 
scores between grades and years. The correlation corr(91q,V11q), for example, describes the 
correlation between the year-to-year trend in district NAEP scores and the trend in the linked scores. 
Likewise the correlation corr(vo92q, V12q) describes the correlation between the grade 4-8 differences in 
district NAEP scores and the corresponding difference in the linked scores. Finally, the correlation 
Corr(Vooa) Viog) describes the correlation between the NAEP and linked intercepts in the model—that 
is, the correlation between linked and TUDA mean scores. This correlation differs from that shown in 
Table 1 above because the former estimates the correlation separately for each grade and year; the 
model in Equation 7 estimates the correlation from a model in which all years and grades are pooled. 
Table 3 shows the results of fitting this model separately by subject to the district means (see 
Table A3 for the corresponding table for standard deviations). When comparing the linked estimates to 
the NAEP TUDA estimates, several patterns are evident. First, the estimated correlation of the TUDA and 
linked intercepts is 0.98 (for both math and reading) and the bias in the means (the difference in the 
estimated intercepts in Table 3) is small and not statistically significant. The linked reading means are, on 
average 1.1 points higher (s.e. of the difference is 3.0; n.s) than the TUDA means; and the linked 
mathematics means are, on average, 2.4 points higher (s.e. of the difference is 3.3, n.s.) than the TUDA 
means. These are, not surprisingly, similar to the average bias estimated from each year and grade 


separately and shown in Table 1. 
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Table 3. Estimated comparison of linked and TUDA district means, pooled across grades and years, by 
subject. 


Reading Math 
Linked EDFacts Parameters 
Intercept (Yoo) 228:53° NEF 250.59 *** 
(2.00) (2.10) 
Year (yo1) 0.91 vam 0.43 * 
(0.17) (0.19) 
Grade (yo2) 10.81 *** 9.58 se 
(0.27) (0.29) 
TUDA Parameters 
Intercept (y10) 22741  *** 248.14 *** 
(2.12) (2.49) 
Year (y11) 1.03 Tee 0.91 mE 
(0.10) (0.11) 
Grade (y12) 10.84 *** 9.68 Hee 
(0.22) (0.17) 
L2 Intercept SDs - Linked (oo) 2.51 2.66 
L2 Intercept SDs - TUDA (01) 0.82 1.26 
Correlation - L2 Residuals 1.00 0.36 
L3 Intercept SDs - Linked (t:) 8.87 9.27 
L3 Intercept SDs - TUDA (ta) 9.79 11.10 
L3 Year Slope SDs - Linked (t2) -- - 
L3 Year Slope SDs - TUDA (ts) - -- 
L3 Grade Slope SDs - Linked (ts) 1.06 1.03 
L3 Grade Slope SDs - TUDA (te) 0.93 0.61 
Correlation - L3 Intercepts 0.98 0.98 
Correlation - L3 Year Slopes -- -- 
Correlation - L3 Grade Slopes 0.85 0.98 
Reliability L3 Intercept - Linked 0.98 0.98 
Reliability L3 Year Slope - Linked -- -- 
Reliability L3 Grade Slope - Linked 0.76 0.73 
Reliability L3 Intercept - TUDA 1.00 1.00 
Reliability L3 Year Slope - TUDA -- -- 
Reliability L3 Grade Slope - TUDA 0.87 0.72 
N — Observations 228 204 
N - Districts 20 20 


Source: Authors’ calculations from EDFacts and NAEP TUDA Expanded 
Population Estimates data. Estimates are based on Equation 9 in text. Note: the 
level 3 random errors on the year slope were not statistically significant, and so 
were dropped from the model. L2 = “Level 2”; L3 = “Level 3”. 


Second, the estimated average linked and TUDA grade slopes (f92 and 712, respectively) are 


nearly identical to one another in both math and reading. The estimated bias in grade slopes (-0.04 in 
reading and -0.10 in math) is only 1% as large as the average grade slope. The implied RMSE from the 
model is 0.56 in reading and 0.46 in math, roughly 5% of the average grade slope.’ The estimated 
correlation of the TUDA and linked grade slopes is 0.85 for reading and 0.98 for math. Finally, the 
reliability of the grade slopes of the linked estimates is 0.76 in reading and 0.73 in math.” Together these 
indicate that the linked estimates provide unbiased estimates of the within-district differences across 
grades, and that these estimates are precise enough to carry meaningful information about between- 
grade differences. 

Third, there is little or no variation in the year trends in the TUDA districts; for both math and 
reading, the estimated variation of year trends is small and not statistically significant. As a result, neither 
the TUDA nor the linked estimates provide estimates of trends across years that are sufficiently reliable to 
be useful (in models not shown, we estimate the reliabilities of the TUDA year trends to be 0.28 and 0.53 
and of the linked year trends to be 0.45 and 0.72 in reading and math, respectively). As a result, we 
dropped the random effects on the year trends and do not report in Table 3 estimates of the variance, 


correlation, or reliability of the year trends. 


Validation Check 4: Recovery of estimates under interpolation between years and grades 

Using the interpolated state means and standard deviations in Equation (1) for the linking 
establishes an assumption that the linkage recovers district scores that would have been reported in 
years 2010 and 2012 and grades 3, 5, 6, and 7. Although we cannot assess recovery of linkages in 
interpolated grades with only grades 4 and 8, we can check recovery for an interpolated year, specifically, 


2011, between 2009 and 2013. By pretending that we do not have 2011 NAEP state data, we can assess 


11 We compute the RMSE of the grade slope from the model estimates as follows. Let C = 79 — fz be the bias in 


the grade slopes; then the RMSE of the grade slope will be: RMSE = [C? + dé2d’']”””, where d = [00100 —1]. 
? The reliability of the level-3 slopes and intercepts is computed as described in Raudenbush and Bryk (2002). 


27 


performance of our interpolation approach by comparing linked estimates to actual 2011 TUDA results. 
For each of the TUDAs that participated in both 2009 and 2013, we interpolate, for example, 


anaep' _ * (qnaee 4 gnaer 
s2011gb — 7 \Hs2009gb T Hs20139b 
1 


anaep' 4 gncer 
s2011gb — 7 \9s2009gb * 9s2013gb 


——_/ 
Applying Equations 2-5, we obtain estimates, for example, Dagon ig and we compare these to actual 


TUDA estimates from 2011. We estimate bias and RMSE for discrepancies ee = Deo using the 
model from Validation Check 1. Table 4 shows results in the same format as Table 1 (see Table A4 for the 
corresponding table for standard deviations). We note that the average RMSE of 3.8 and bias of 1.4 in 
Table 4 are approximately the same as the average RMSE of 3.8 and bias of 1.6 shown for 2011 in Table 1. 
Note that the interpolations we use in our proposed linking are those between observed scores that are 
only two years apart, rather than four years apart as in the validation exercise here. The two-year 
interpolations should be more accurate than the four-year interpolation, which itself is accurate enough 
to show no degradation in our recovery of estimated means. We conclude that the between-year 
interpolation of state NAEP scores adds no appreciable error to the linked estimates for TUDA districts. 


Table 4. Recovery of reported 2011 NAEP TUDA means following state-level linkage of state test score 
distributions to a NAEP scale interpolated between 2009 and 2013, measurement error adjusted. 


Recovery 
Subject Grade Year N RMSE Bias Correlation 
; 4 2011 20 3.78 0.89 0.95 
Reading 
8 2011 20 2.14 1.47 0.99 
4 2011 20 4.67 2.25 0.94 
Math 
8 2011 14 3.81 1.66 0.96 
Average Reading 20 3.07 1.18 0.97 
Math 14-20 4.26 1.95 0.95 
All 14-20 3.72 1.57 0.96 


Note: Using NAEP Expanded Population Estimates. Adjusted correlations account for 
imprecision in linked and target estimates. 
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We next investigate the viability of interpo 


estimates with MAP scores at different degrees of i 


interpolation, others are singly interpolated, and others are doubly interpolated. Table 5 shows that, on 


average, precision-adjusted correlations between | 


ation by comparing correlations of linked district 


nterpolation. Some grade-year combinations need no 


inked NAEP means and MAP means are almost 


identical across different degrees of interpolation, around 0.93 (see Table A5 for the corresponding table 


for standard deviations). This lends additional evidence that interpolation adds negligible aggregate error 


to recovery between years as well as grades. 


Table 5: Precision-adjusted correlations between NWEA MAP district means and NAEP-linked estimates. 


Subject Grade 2009 2010 2011 2012 2013 
3 0.95 0.94 0.93 0.93 0.94 
4 0.95 0.94 0.93 0.94 0.95 
: 5 0.94 0.94 0.93 0.93 0.94 
Reading 
6 0.92 0.94 0.92 0.93 0.93 
7 0.92 0.93 0.92 0.92 0.92 
8 0.91 0.91 0.91 0.91 0.92 
3 0.91 0.89 0.91 0.91 0.91 
4 0.93 0.92 0.90 0.92 0.93 
5 0.91 0.90 0.92 0.91 0.93 
Math 
6 0.93 0.93 0.94 0.93 0.94 
7 0.94 0.95 0.95 0.95 0.95 
8 0.93 0.93 0.92 0.94 0.95 
No interpolation 0.93 
Single interpolation 0.93 
Double interpolation 0.93 


Note. Linked using NAEP Expanded Population Estimates. NWEA MAP = Northwest Evaluation 
Association Measures of Academic Progress. Sample includes districts with reported NWEA MAP 


scores for at least 90% of students. 


Quantifying the Uncertainty in Linked Estimates 


The validation analyses above suggest that the linked estimates correspond quite closely to their 


target values on average. But, as is evident in Table 1 and Figure 1, the degree of discrepancy varies 


among districts. NAEP and state assessments do not locate districts identically within states, implying 
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linking error in the estimates. In this section, we provide a framework for quantifying the magnitude of 
this linking error. This in turn allows us to construct approximate standard errors for cross-state 
comparisons based on the linking. 

We begin by rewriting Equation (2), omitting the ygb subscripts and using superscripts n and s in 
place of “naep” and “state” for parsimony. Here we distinguish the linked mean for a district d (estimated 
using Equation (2), and denoted here as Re) from the mean NAEP score we would observe if all 


students in a district took the NAEP assessment (this is the target parameter, denoted here as yyy): 


Agrk = am + pg - 6? 
= (us + vs!) + (ug + va) * Cos’ + we") 
= (us + gos) + (Uaws' + as'vg + vaws' + vs) 
= pink + (uSw? + oPvs + vSwe + vt) 


= ua t+ 6g + Cugw? + of vg + vgw? + ve). 


Equation (10) shows that the estimated linked mean has two sources of error: linking error (6g) and 
sampling/estimation error (the term in parentheses) which results from error in the estimated state mean 
and standard deviation on the NAEP assessment and from estimation error in the district mean on the 
state assessment. The variance of the sampling/estimation error term is given by Equation (4) above. The 
variance of the linking error is not known for the full population, but we can illustrate its magnitude using 
the RMSEs for TUDAs reported in Table 1. The average root mean squared linking error in Table 1 is about 
4 points on the NAEP scale. Because the standard deviation of national NAEP scores within a grade, year, 
and subject is typically 28-38 points, the RMSE is roughly 4/33~+0.12 student-level standard deviation 
units. To the extent that the RMSE of the linking errors across the TUDA districts is representative of the 
standard deviation of the linking errors in the full population, we can then approximate the standard 


deviation of dg as 0.12. 
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We can now compute approximate standard errors of the linked estimates that take into account 
linking error as well as sampling and estimation error. The variance of the linked mean will be 
var (au") = var (dq) + (ug)? var(w?) + (o2)* var (vg) + var (v§): var(w?) + var(v2) 
(10) 
We have estimates of of’, var(v), and var(w2) from NAEP; we have estimates of u3 and var(v3) 
from the state assessment data; and we have an approximation of var(6,) from Table 1 above. Using 
appropriate estimates of these parameters, we can compute standard errors of the linked assessments 
from Equation (10). 

How large is the linking error relative to the other sources of error in the linked estimate? We can 
get a sense of this from Figure 3, which shows the standard error of the linked estimate as a function of 
the magnitude of the standard error of the state assessment mean, sd(v3), and the linking error 
variance, sd(6q). The standard errors of the state assessment means in our data range from roughly 0.01 
to 0.25, with a median of roughly 0.10 (the CDF of the distribution of standard errors is shown in Figure 3 
for reference). The figure includes four lines describing the standard error of the linked means under 4 
different assumptions about the magnitude of the linking error: sd(éq) € {0, 0.06, 0.12, 0.18). The 
scenario sd(6q) = 0 corresponds to the assumption that there is no linking error (an unrealistic 
assumption, but useful as a reference). The TUDA analyses in Table 1 suggest a value of sd(6q) = 0.12; 
we include higher and lower values for sd(6,) in order to describe the plausible range. The other terms 
in Equation (10) are held constant at values near their empirical medians: w5 = 0; og = 0.98; var(v?) = 
0.001; and var(w2") = 0.0004. The computed standard errors are very insensitive to these terms across 


the full range of their empirical values in the NAEP and state assessment data we use. 
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Figure 3. Reliabilities and standard errors (in NAEP national standard deviation units) of linked means on 
the NAEP scale across a range of specifications. 
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Figure 3 shows that the standard errors of the linked estimates are meaningfully larger when we 
take into account the linking error than when we assume it is 0, but this is more true for districts with 
small standard errors on the state assessment. In terms of standard errors, linking error plays a larger role 
when other sources of error are small. 

Figure 3 also describes the implied reliability of the linked estimates. We estimate the standard 


deviation of the district means (excluding measurement error) across all districts to be roughly 0.34 


0.34 
0.34+var(alink)’ 


national student-level standard deviations. Using this, we compute the reliability as r = 
Under the assumption of linking error with sd(6q) = 0.12, the reliability of the linked means is lower 
than if we assume no linking error (as expected). However, it is still about 0.70 for the roughly 91% of 
districts with state standard errors less than 0.19. This suggests that, even in the presence of linking error, 
there is enough signal in the linked estimates to be broadly useful for distinguishing among districts. 


We can also compute standard errors of the difference between two districts’ means. The 
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formula will differ for districts in the same state vs different states. For two districts in the same state: 


var (agi — Age") = 2var(6) + (ug, — H§2)?var(w?) + (62)? [var (vj,) + var (vin)] 


(12) 
For 2 districts in different states, however: 
var (fark — age") = 2var(6) + (u§,)?var(wH) + (Ugo)? var(wz) + (os)? var (v§,) 
+(03,)?var(vg,) + var(v%,) + var(ve). 
(13) 


Assuming the sampling variances of the state means and standard deviations are the same in 
both states (which is approximately true in this case, given that NAEP sample sizes and standard 
deviations are similar among states), this is: 

var (ali — alle) 

~ 2var(5) + [(ui1)? + (ug2)*lvar(wh) + (of)? [var(v§,) + var(v§2)] + 2var(v?). 
(14) 

We ignore the var(v3) : var(w;') terms because they are very small relative to the other terms 
in the formula. The difference in the variance of a between-state and a within-state comparison, holding 
all other terms constant, will be 2u2,42.var(w”) + 2var(vi"). The same-state and different state 
formulas differ because the within state comparisons share the same sampling error in the state NAEP 
means and standard deviations. As a result there is generally more uncertainty in between-state 
comparisons than within-state comparisons on the NAEP scale. However, the difference is generally 
small. Both between-state and within-state comparisons share the linking error variance (2var(6)) and 
the sampling/estimation error variance in the district means (var(vj,) + var(vj2)). In addition, 


var (w.") and var(v%') are small relative to these two sources of error in the case we examine here. The 
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result is that the errors in within- and between-state differences in linked means are generally similar. * 


Discussion 

We present validation methods for aggregate-level linking, motivated by the goal of constructing 
a U.S.-wide district-level dataset of test score means and standard deviations. We demonstrate that test 
score distributions on state standardized tests can be linked to a national NAEP-linked scale in a way that 
yields district-level distributions that correspond well—but not perfectly—to the absolute performance of 
districts on NAEP and the relative performance of districts on MAP. The correlation of district-level mean 
scores on the NAEP-linked scale with scores on the NAEP TUDA and NWEA MAP assessments is generally 
high (averaging 0.95 and 0.93 across grades, years, and subjects). Nonetheless, we find some evidence 
that NAEP-linked estimates include some small, but systematically positive, bias in large urban districts 
(roughly +0.05 standard deviations, on average). This implies a corresponding small downward bias for 
other districts in the same states, on average. 

Are these discrepancies a threat to the validity of the linked estimates of district means? The 
answer depends on how the estimates will be used. Given evidence of the imperfect correlation and 
small bias, the linked estimates should not be used to compare or rank school districts’ performance 
when the estimated means are close and when the districts are in different states. As we noted, there are 
several possible sources of error in a cross-state comparison, including differences in content, motivation, 


sampling, and inflation. Our methods cannot identify the presence of any one type of error, but do allow 


13 To illustrate this, suppose we assume a low value for the linking error variance (say var(6) = .0036, one-quarter 
of what is implied by the RMSE of the NAEP TUDA means). Let us compare two districts with small standard errors 
(say var (fig,) = var(figy) = -0025, which is smaller than 90% of districts; see Figure 3). If both districts have 
estimated means two standard deviations from their state means (so fig, = gz = 2, an extreme case), then 
Equation (12) indicates that the error in a within-state comparison will have variance of 0.0122, while Equation (14) 
indicates that a between-state comparison will have a modestly larger error variance of 0.0174. In most 
comparisons (where the error variances of the estimated district means are larger, where the district means are not 
so far from their state averages, or when the linking error variance is larger), the difference in the error variance of 
between- and within-state comparisons will be much smaller. 


34 


us to quantify the total amount of error in cross-state comparisons. This error is small relative to the 
distribution of test scores and the variation in average district scores. Of course, relative comparisons 
within states do not depend on the linking procedure, so these are immune to bias and variance that 
arises from the linking methods. 

On the basis of these results, we believe the linked estimates are accurate enough to be used to 
investigate broad patterns in the relationships between average test performance and local community 
or schooling conditions, both within and between states. The validation exercises suggest that the linked 
estimates can be used to examine variation among districts and across grades within districts. It is unclear 
whether the estimates provide unbiased estimates of within-grade trends over time, given that there is 
little or no variation in the NAEP district trends over time against which to benchmark the linked trend 
estimates. This is true more generally even of within-grade national NAEP trends, which are often 
underpowered to detect true progress over shorter time spans of 2- to 4-years. 

Validation methods must begin with an intended interpretation or use of scores (Kane, 2013). An 
operational interpretation of the linked aggregate estimates is the result of monotonic transformations of 
district score distributions on state tests. They are state score distributions with NAEP-based adjustments, 
with credit given for being in a state with relatively high NAEP performance and, for districts within the 
states, greater discrimination among districts when a state’s NAEP standard deviation is high. Our 
contribution is to provide a strategy and methods for answering the essential counterfactual question: 
What would district results have been, had district scores on NAEP or MAP been available? When 
patchwork administrations of district tests are available, we can obtain direct and indirect answers to this 
question. 

Because the testing conditions, purpose, motivation, and content of NAEP and state tests differ, 
we find that district results do differ across tests. But our validation checks suggest that these differences 


are generally small relative to the variation among districts. This is evident in the high correspondence of 
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the linked and NAEP TUDA estimates and of the linked and NWEA MAP estimates. This suggests that our 
set of estimated NAEP-linked district test score results may be useful in empirical research describing and 
analyzing national variation in local academic performance. When data structures with patchwork 
administrations of tests are available in other U.S. and international testing contexts, our strategy and 
methods are a roadmap to not only link scores at the aggregate level but to validate interpretations and 


uses of linked scores. 
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Appendix Tables 


Table Al: Recovery of NAEP TUDA standard deviations following state-level linkage of state test score 
distributions to the NAEP scale, measurement error adjusted. 


Recovery 
Subject Grade Year n RMSE Bias Correlation 
2009 17 0.89 -1.31 0.76 
4 2011 20 2.28 -0.14 0.46 
Reading 2013 20 1.07 0.34 0.96 
2009 17 1.84 -1.33 0.88 
8 2011 20 1.80 -1.00 0.41 
2013 20 1.01 -1.27 0.62 
2009 17 1.40 -0.09 0.72 
4 2011 20 LL 0.83 0.68 
Math 2013 20 1.64 0.99 0.71 
2009 14 2.15 -0.62 0.77 
8 2011 17 1.86 0.24 0.79 
2013 14 2.49 0.15 0.56 
2009 14-17 1.64 -0.84 0.78 
2011 17-20 1.93 -0.02 0.59 
Average 2013 14-20 1.66 0.05 0.71 
Reading 17-20 1.57 -0.79 0.68 
Math 14-20 1.91 0.25 0.71 
All 14-20 1.75 -0.27 0.69 
Male 14-20 1.60 -0.13 0.72 
Female 14-20 1.76 0.05 0.65 
ee White 11-19 4.43 2.99 0.42 
Black 13-19 1.90 -0.24 0.40 
Hispanic 12-20 2.51 -1.24 0.34 


Source: Authors calculations from EDFacts and NAEP TUDA Expanded Population 
Estimates data. Estimates are based on Equation 7 in text. Subgroup averages are 
computed from a model that pools across grades and years within subject (like Equation 
9 in text); the subject averages are then pooled within subgroup. 


Table A2: Precision-adjusted correlations of linked district standard deviations with NWEA MAP district 
standard deviations before and after state-level linkage of state test score distributions to the NAEP scale. 


Precision-Adjusted Correlations 
..With HETOP ..With Linked 


Subject Grade Year n 


Estimates Estimates 
2009 1139 0.52 0.58 
4 2011 1472 0.55 0.61 
; 2013 1843 0.59 0.64 
Reading 
2009 959 0.54 0.57 
8 2011 1273 0.51 0.58 
2013 1597 0.52 0.51 
2009 1128 0.65 0.75 
4 2011 1467 0.58 0.66 
2013 1841 0.63 0.69 
Math 
2009 970 0.65 0.70 
8 2011 1279 0.55 0.62 
2013 1545 0.65 0.70 
2009 4196 0.59 0.65 
Average 2011 5491 0.55 0.62 
2013 6826 0.60 0.64 
All Years 16513 0.58 0.63 


Source: Authors’ calculations from EDFacts and NWEA MAP data. NWEA MAP = Northwest 
Evaluation Association Measures of Academic Progress. Sample includes districts with reported 
NWEA MAP scores for at least 90% of students. 
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Table A3. Estimated comparison of linked and TUDA district standard deviations, pooled across grades 
and years, by subject. 


Reading Math 
Linked EDFacts Parameters 
Intercept (Yoo) 36,21 Fe 33:17: . 48" 
(0.42) (0.35) 
Year (yo1) 0.18 + 0.39 ete 
(0.10) (0.11) 
Grade (oa) -0.68 *** 1.66 eee 
(0.13) (0.18) 
TUDA Parameters 
Intercept (y10) 36.85 *** 32:75: Oo e*F 
(0.39) (0.36) 
Year (y11) 0.07 0.17 * 
(0.10) (0.08) 
Grade (y12) -0.38 ee 1.81 hee 
(0.13) (0.14) 
L2 Intercept Variance - Linked (c0’) 0.96 134 
L2 Intercept Variance - TUDA (01°) 0.74 0.43 
Correlation - L2 Residuals 1.00 1.00 
L3 Intercept Variance - Linked (t1’) 1.73 1.30 
L3 Intercept Variance - TUDA (t4’) 1.60 1.43 
L3 Year Slope Variance - Linked (t22) -- - 
L3 Year Slope Variance - TUDA (ts*) - -- 
L3 Grade Slope Variance - Linked (t3?) 0.44 0.67 
L3 Grade Slope Variance - TUDA (te*) 0.46 0.54 
Correlation - L3 Intercepts 0.60 0.70 
Correlation - L3 Year Slopes -- -- 
Correlation - L3 Grade Slopes 0.77 0.89 
Reliability L3 Intercept - Linked 0.86 0.76 
Reliability L3 Year Slope - Linked -- -- 
Reliability L3 Grade Slope - Linked 0.63 0.77 
Reliability L3 Intercept - TUDA 0.84 0.87 
Reliability L3 Year Slope - TUDA -- -- 
Reliability L3 Grade Slope - TUDA 0.63 0.79 
N - Observations 228 204 
N - Districts 20 20 


Source: Authors’ calculations from EDFacts and NAEP TUDA Expanded 
Population Estimates data. Estimates are based on Equation 9 in text. Note: the 
level 3 random errors on the year slope were not statistically significant, and so 
were dropped from the model. L2 = “Level 2”; L3 = “Level 3”. 


Table A4. Recovery of reported 2011 NAEP TUDA standard deviations following state-level linkage of state 
test score distributions to a NAEP scale interpolated between 2009 and 2013, measurement error 
adjusted. 


Recovery 
Subject Grade Year n RMSE Bias Correlation 
? 4 2011 20 2.12 0.39 0.53 
Reading 
8 2011 20 2.53 -0.70 0.00 
4 2011 20 2.15 1.43 0.53 
Math 
8 2011 14 1.52 0.65 0.93 
Average Reading 20 2.34 -0.15 0.26 
Math 14-20 1.86 1.04 0.73 
All 14-20 2.11 0.44 0.50 


Note: Using NAEP Expanded Population Estimates. Adjusted correlations account for 
imprecision in linked and target estimates. 
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Table AS: Precision-adjusted correlations between NWEA MAP district standard deviations and NAEP- 


linked estimates. 


Subject Grade 2009 2010 2011 2012 2013 
3 0.53 0.57 0.63 0.62 0.61 
4 0.58 0.57 0.61 0.62 0.64 
: 5 0.61 0.54 0.58 0.64 0.64 
Reading 
6 0.60 0.58 0.61 0.64 0.57 
7 0.60 0.57 0.56 0.55 0.53 
8 0.57 0.56 0.58 0.52 0.51 
3 0.71 0.70 0.69 0.67 0.65 
4 0.75 0.77 0.66 0.71 0.69 
5 0.73 0.69 0.73 0.72 0.74 
Math 
6 0.73 0.75 0.69 0.71 0.73 
7 0.75 0.65 0.70 0.71 0.70 
8 0.70 0.64 0.62 0.71 0.70 
No interpolation 0.63 
Single interpolation 0.58 
Double interpolation 0.64 


Note. Linked using NAEP Expanded Population Estimates. NWEA MAP = Northwest Evaluation Association Measures 


of Academic Progress. Sample includes districts with reported NWEA MAP scores for at least 90% of students. 
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